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HIGH-PRECISION MATRIX-VECTOR MULTIPLICATION 
ON A CHARGE-MODE ARRAY WITH EMBEDDED DYNAMIC MEMORY 
AND STOCHASTIC METHOD THEREOF 

Related Applications 

The present patent application claims the benefit of the priority from US provi- 
sional application 60/430,605 filed on December 3, 2002. 

Field of the Invention 

The invention is directed toward fast and accurate multiplication of long vec- 
tors with large matrices using analog and digital integrated circuits. This applies to 
efficient computing of discrete linear transforms, as well as to other signal processing 
applications. 

Background of the invention 

The computational core of a vast number of signal processing and pattern recog- 
nition algorithms is that of matrix- vector multiplication (MVM): 

N-l 

Y m = J] W mn X n (Eq. 1) 

71=0 

with N-dimensional input vector X, M-dimensional output vector Y, and N x M 
matrix elements W mn . In engineering, MVM can generally represent any discrete lin- 
ear transformation, such as a filter in signal processing, or a recall in neural networks. 
Fast and accurate matrix- vector multiplication of large matrices presents a significant 
technical challenge. 

Conventional general-purpose processors and digital signal processors (DSP) 
lack parallelism needed for efficient real-time implementation of MVM in high di- 
mensions. Multiprocessors and networked parallel computers in principle are capable 
of high throughput, but are costly, and impractical for low-cost embedded real-time 
applications. Dedicated parallel VLSI architectures have been developed to speed up 
MVM computation. The problem with most parallel systems is that they require cen- 
tralized memory resources i.e., memory shared on a bus, thereby limiting the avail- 
able throughput. A fine-grain, fully-parallel architecture, that integrates memory and 
processing elements, yields high computational throughput and high density of inte- 
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gration [J.C. Gealow and C.G. Sodini, "A Pixel-Parallel Image Processor Using Logic 
Pitch-Matched to Dynamic Memory," IEEE J, Solid-State Circuits, vol. 34, pp 831- 
839, 1999]. The ideal scenario (in the case of matrix- vector multiplication) is where 
each processor performs one multiply and locally stores one coefficient. The advan- 
tage of this is a throughput that scales linearly with the dimensions of the implemented 
array. The recurring problem with digital implementation is the latency in accumulat- 
ing the result over a large number of cells. Also, the extensive silicon area and power 
dissipation of a digital multiply-and-accumulate implementation make this approach 
prohibitive for very large (1,000-10,000) matrix dimensions. 

Analog VLSI provides a natural medium to implement fully parallel computa- 
tional arrays with high integration density and energy efficiency [A. Kramer, "Array- 
based analog computation," IEEE Micro, vol. 16 (5), pp. 40-49, 1996]. By summing 
charge or current on a single wire across cells in the array, low latency is intrinsic. 
Analog multiply-and-accumulate circuits are so small that one can be provided for 
each matrix element, making it feasible to implement massively parallel implementa- 
tions with large matrix dimensions. Fully parallel implementation of (Eq. 1) requires 
an M x N array of cells, illustrated in Figure 1. Each cell (m,n) (101) computes 
the product of input component X n (102) and matrix element W mn (104), and dumps 
the resulting current or charge on a horizontal output summing line (103). The de- 
vice storing W mn is usually incorporated into the computational cell to avoid per- 
formance limitations due to low external memory access bandwidth. Various physi- 
cal representations of inputs and matrix elements have been explored, using charge- 
mode (US patent 5,089,983 to Chiang; US Patent 5,258,934 to Agranat et al.; US 
Patent 5,680,515 to Barhen at al.) transconductance-mode [F. Kub, K. Moon, I. Mack, 
F. Long, " Programmable analog vector-matrix multipliers," IEEE Journal of Solid- 
State Circuits, vol. 25 (1), pp. 207-214, 1990], [G. Cauwenberghs, C.F Neugebauer 
and A. Yariv, 'Analysis and Verification of an Analog VLSI Incremental Outer-Product 
Learning System" IEEE Trans. Neural Networks, vol. 3 (3), pp. 488-497, May 1992.], 
or current-mode [A.G. Andreou, K.A. Boahen, P.O. Pouliquen, A. Pavasovic, R.E. Jenk- 
ins, and K. Strohbehn, "Current-Mode Subthreshold MOS Circuits for Analog VLSI 
Neural Systems," IEEE Transactions on Neural Networks, vol. 2 (2), pp 205-213, 
1991] multiply-and-accumulate circuits. 

A hybrid analog-digital technology for fast and accurate charge-based matrix- 



3 



vector multiplication (MVM) was invented by Barhen et al. in US Patent 5,680,515. 
The approach combines the computational efficiency of analog array processing with 
the precision of digital processing and the convenience of a programmable and recon- 
figurable digital interface. The digital representation is embedded in the analog array 
architecture, with inputs presented in bit-serial fashion, and matrix elements stored 
locally in bit-parallel form: 

W mn = EV-^W (Eq.2) 

i=0 
j=0 

decomposing (Eq. 1) into: 

Ym = £ W ™X n = £ £ 2^~' i YS^ (Eq. 4) 

n=0 2=0 j=0 

with binary-binary MVM partials: 

V£ S) =Zv£rfP- (Eq-5) 

n=0 

The key is to compute and accumulate the binary-binary partial products (Eq. 5) using 
an analog MVM array, quantize them, and to combine the quantized results 

n=0 

according to (Eq. 4), now in the digital domain 

Digital-to-analog conversion at the input interface is inherent in the bit-serial imple- 
mentation, and row-parallel analog-to-digital converters (ADCs) are used at the output 
interface to quantize Y m ^\ 

The bit-serial format of the inputs (Eq. 3) was first proposed by Agranat et al. in 
US Patent 5,258,934, with binary-analog partial products using analog matrix elements 
for higher density of integration. The use of binary encoded matrix elements (Eq. 2) 
relaxes precision requirements and simplifies storage as was described by Barhen et al. 
in US Patent 5,680,515. A number of signal processing applications mapped onto such 
an architecture was given by Fijany et al. in US Patent 5,508,538 and Neugebauer in 
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US Patent 5,739,803. A charge injection device (CID) can be used as a unit computa- 
tion cell in such an architecture as in US Patent 4,032,903 to Weimer, and US Patent 
5,258,934 at Agranat et al. 

To conveniently implement the partial products (Eq. 5), the binary encoded ma- 
trix elements w mn ^ (201) are stored in bit-parallel form, and the binary encoded inputs 
x n ® (202) are presented in bit-serial fashion as shown in Figure 2. The figure presents 
the block diagram of one row in the matrix with binary encoded elements w mn ^\ for 
a single m and with 1 = 4 bits, and the data flow of bit-serial inputs x n ^ and corre- 
sponding partial outputs Y m ^ j \ with J = 4 bits. Analog partial products (203) (Eq. 5) 
are quantized and combined together in the analog-to-digital conversion block (204) 
to produce the output Q m (Eq. 7). Figure 2 depicts a detailed block diagram of one 
slice (301) of the top level architecture based on US Patent 5,680,515 to Barhen et al. 
outlined with a dashed line in Figure 3. 

Despite the success of adaptive algorithms and architectures in reducing the ef- 
fect of analog component mismatch and noise on system performance, the precision 
and repeatability of analog VLSI computation under process and environmental vari- 
ations is inadequate for many applications. A need still exists therefore for fast and 
high-precision matrix-vector multipliers for very large matrices. 

Summary of the Invention 

It is one objective of the present invention to offer a charge-based apparatus to 
efficiently multiply large vectors and matrices in parallel, with integrated and dynam- 
ically refreshed storage of the matrix elements. The present invention is embodied in 
a massively-parallel internally analog, externally digital electronic apparatus for ded- 
icated array processing that outperforms purely digital approaches with a factor 100- 
10,000 in throughput, density and energy efficiency. A three-transistor unit cell com- 
bines a single-bit dynamic random-access memory (DRAM) and a charge injection 
device (CID) binary multiplier and analog accumulator. High cell density and com- 
putation accuracy is achieved by decoupling the switch and input transistors. Digital 
multiplication of variable resolution is obtained with bit-serial inputs and bit-parallel 
storage of matrix elements, by combining quantized outputs from multiple rows of 
cells over time. Use of dynamic memory eliminates the need for external storage of 
matrix coefficients and their reloading. 
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It is another objective of the present invention to offer a method to improve reso- 
lution of charge-based and other large-scale matrix-vector multipliers through stochas- 
tic encoding of vector inputs. The present invention is also embodied in a stochastic 
scheme exploiting Bernoulli random statistics of binary vectors to enhance digital res- 
olution of matrix- vector computation. Largest gains in system precision are obtained 
for high input dimensions. The framework allows to operate at full digital resolution 
with relatively imprecise analog hardware, and with minimal cost in implementation 



complexity to randomize the input data. 

Description of Drawings 

1 General architecture for fully parallel matrix- vector multiplication. ... 14 

2 Block diagram of one row in the matrix with binary encoded elements 

and data flow of bit-serial inputs 15 

3 Top level architecture of a matrix- vector multiplying processor. .... 16 

4 Circuit diagram of CID computational cell with integrated DRAM stor- 
age (top) and charge transfer diagram for active write and compute oper- 
ations (bottom) 17 

5 Two charge-mode AND cells configured as an exclusive-OR (XOR) multiply- 
and-accumulate gate 18 

6 Two charge-mode AND cells with inputs time-multiplexed on the same 
node, configured as an exclusive-OR (XOR) multiply-and-accumulate 

gate 19 

7 A single row of the analog array in the stochastic architecture with Bernoulli 
modulated signed binary inputs and fixed signed weights 20. 

8 Output of a single row of the analog array, (bottom), and its prob- 
ability distribution (top) in the stochastic architecture with Bernoulli en- 
coded inputs 21 

9 Input modulation and output reconstruction scheme in the stochastic MVM 
architecture 22 

Detailed Description 



The present invention enhances precision and density of the integrated matrix- 
vector multiplication architectures by using a more accurate and simpler CID/DRAM 
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computational cell, and a stochastic input modulation scheme that exploits Bernoulli 
random statistics of binary vectors. 

CID/DRAM Cell 

The circuit diagram and operation of the unit cell in the analog array are given 
in Figure 4. It combines a CID computational element (411) with a DRAM storage 
element (410). The cell stores one bit of a matrix element w mn ^\ performs a one- 
quadrant binary-binary multiplication of w mn ^ and x n (j) in (Eq. 5), and accumulates 
the result across cells with common m and i indices. An array of cells thus performs 
(unsigned) binary multiplication (Eq. 5) of matrix w mn ^ and vector x n ^ yielding 
YrrS l,j \ for values of i in parallel across the array, and values of j in sequence over 
time. 

The cell contains three MOS transistors connected in series as depicted in Fig- 
ure 4. Transistors Ml (401) and M2 (402) comprise a dynamic random- access memory 
(DRAM) cell, with switch Ml controlled by Row Select signal J?S m w on line (405). 
When activated, the binary quantity u> mn (i) is written in the form of charge (either AQ 
or 0) stored under the gate of M2. Transistors M2 (402) and M3 (403) in turn comprise 
a charge injection device (CID), which by virtue of charge conservation moves electric 
charge between two potential wells in a non-destructive manner. 

The bottom diagram in Figure 4 depicts the charge transfer timing diagram for 
write and compute operations. The cell operates in two phases: Write/Refresh and 
Compute. When a matrix element value is being stored, x n ® is held at 0V and Vout 
at a voltage Vdd/2. To perform a write operation, either an amount of electric charge 
is stored under the gate of M2, if w mn ^ is low, or charge is removed, if w mn ^ is high. 
The charge (408) left under the gate of M2 can only be redistributed between the two 
CID transistors, M2 and M3. An active charge transfer (409) from M2 to M3 can only 
occur if there is non-zero charge (412) stored, and if the potential on the gate of M3 
rises above that of M2 as illustrated in the bottom of Figure 4. This condition implies 
a logical AND, i.e., unsigned binary multiplication, of w mn ^ on line (404) and x n (j) 
on line (406). The multiply-and-accumulate operation is then completed by capac- 
itively sensing the amount of charge transferred off the electrode of M2, the output 
summing node (407) . To this end, the voltage on the output line, left floating after 
being pre-charged to Vdd/2, is observed. When the charge transfer is active, the cell 
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contributes a change in voltage AV^ = AQ/Cm2 where Cmi is the total capacitance 
on the output line across cells. The total response is thus proportional to the number 
of actively transferring cells. After deactivating the input x n ^\ the transferred charge 
returns to the storage node M2. The CID computation is non-destructive and intrinsi- 
cally reversible [C. Neugebauer and A. Yariv, "A Parallel Analog CCD/CMOS Neural 
Network IC," Proc. IEEE Int. Joint Conference on Neural Networks (IJCNN'91), Seat- 
tle, WA, vol. 1, pp 447-451, 1991], and DRAM refresh is only required to counteract 
junction and subthreshold leakage. 

In one possible embodiment of the present invention, the gate of M2 is the out- 
put node and the gate of M3 is the input node. This configuration allows for simplified 
peripheral array circuitry as the potential on the bit-line w mn ^ is a truly digital signal 
driven to either 0 or Vd± The signal-to-noise ratio of the cell presented in this in- 
vention is superior due to the fact that the potential well corresponding to M3 is twice 
deeper than that of M2. 

In another possible embodiment of the present invention, to improve linearity 
and to reduce sensitivity to clock feedthrough, differential encoding of input and stored 
bits in the CID/DRAM architecture using twice the number of columns (501) and unit 
cells (502) is implemented as shown in Figure 5. This amounts to exclusive-OR (503) 
(XOR), rather than AND, multiplication on the analog array, using signed, rather than 
unsigned, binary values for inputs and weights, x n ^ = ±1 and w mn ^ = ±1. 

In another possible embodiment of the present invention, a more compact imple- 
mentation for signed multiply-and-accumulate operation is possible using the CID/DRAM 
cell as the switch transistor Ml and input transistor M3 are decoupled by transistor M2 
and can be multiplexed on the same wire. Both input and storage operations can be 
time-multiplexed on a single wire (601) as shown in Figure 6. This makes the cell 
pitch in the array limited only by a single bit-line metal layer width allowing for a very 
dense array design. 

Resolution Enhancement Through Stochastic Encoding 

Since the analog inner product (Eq. 5) is discrete, zero error can be achieved (as 
if computed digitally) by matching the quantization levels of the ADC with each of the 
N + 1 discrete levels in the inner product. Perfect reconstruction of Y m ^ from the 
quantized output, for an overall resolution of I + J + log 2 (iV + 1) bits, assumes the 
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combined effect of noise and nonlinearity in the analog array and the ADC is within 
one LSB (least significant bit). For large arrays, this places stringent requirements on 
analog computation precision and ADC resolution, L > \og 2 (N + 1). 

In what follows signed, rather than unsigned, binary values for inputs and weights, 
x n (j) = ±1 and w mn ® = ±1 are assumed. This translates to exclusive-OR (XOR), 
rather than AND, multiplication on the analog array, an operation that can be easily ac- 
complished with the CID/DRAM architecture by differentially coding input and stored 
bits using twice the number of columns and unit cells as shown in Figures 5 and 6. A 
single row of such a differential architecture is depicted in Figure 7. 

The implicit assumption is that all quantization levels are (equally) needed. Anal- 
ysis of the statistics of the inner product reveals that this is poor use of available re- 
sources. The principle outlined below extends to any analog matrix-vector multiplier 
that assumes signed binary inputs and weights. 

For input bits x n ^ (701) that are Bernoulli (i.e., fair coin flips) distributed, and 
fixed signed binary coefficients w mn ^ (702), the (XOR) product terms w mn ®x n W (703) 
in (Eq. 5) are Bernoulli distributed, regardless of w mr S l \ Their sum Yj^ j) (704) thus 
follows a binomial distribution 



with p = 0.5, k = 0, N, which in the Central Limit TV — ► oo approaches a normal 
distribution with zero mean and variance N. In other words, for random inputs in high 
dimensions N the active range (or standard deviation) of the inner-product (704) (Eq. 
5) is N 1/2 , a factor N l/2 smaller than the full range N. 

Figure 8 illustrates the effect of Bernoulli distribution of the inputs on the statis- 
tics of an array row output. It depicts an illustration of the output of a single row of 
the analog array, Y^\ and its probability density in the stochastic architecture with 
Bernoulli encoded inputs. On the top diagram of Figure 8, is a discrete ran- 
dom variable with probability density approaching normal distribution for large N. In 
Central limit the standard deviation is proportional to the square root of the full range, 
N l l 2 . Reduction of the active range of the inner-product to N 1 / 2 allows to relax the 
effective resolution of the ADC by a factor proportional to N 1/2 , as the number of 
quantization levels is proportional to N 1 / 2 , not N. This gain is especially beneficial 
for parallel (flash) quantizers in the architecture shown in Figure 2, as their area re- 




(Eq. 8) 
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quirements grow exponentially with the number of bits. On the bottom diagram of 
Figure 8, Bernoulli modulation of inputs allows to significantly relax requirements on 
the linearity of the analog addition (Eq. 5) by making non-linearity outside the reduced 
active range irrelevant . 

In principle, this allows to relax the effective resolution of the ADC. However, 
any reduction in conversion range will result in a small but non-zero probability of 
overflow. In practice, the risk of overflow can be reduced to negligible levels with 
a few additional bits in the ADC conversion range. An alternative strategy is to use 
a variable resolution ADC which expands the conversion range on rare occurrences 
of overflow (or, with stochastic input encoding, overflow detection could initiate a 
different random draw). 

Although most randomly selected patterns do not correlate with any chosen tem- 
plate, patterns from the real world tend to correlate. The key is stochastic encoding of 
the inputs, as to randomize the bits presented to the analog array. 

Randomizing an informative input while retaining the information is a futile 
goal, and the present invention comprises a solution that approaches the ideal perfor- 
mance within observable bounds, and with reasonable cost in implementation. Given 
that "ideal" randomized inputs relax the ADC resolution by log 2 N/2 bits, they neces- 
sarily reduce the wordlength of the output by the same. To account for the lost bits in 
the range of the output, one could increase the range of the "ideal" randomized input 
by the same number of bits. 

One possible stochastic encoding scheme that restores the range is to modulate 
the input with a random number. For each /-bit input component X ny pick a random 
integer U n in the range ±(R - 1), and subtract it to produce a modulated input X n — 
X n — U n with log2R additional bits. As one possible embodiment of the invention, one 
could choose R to be N l/2 leading to log 2 N/2 additional bits in the input encoding. 

It can be shown that for worst-case deterministic inputs X n the mean of the inner 
product for X n is off at most by ±N 1 / 2 from the origin. 

Note that U n is uniformly distributed across its range, and therefore its binary co- 
efficients u$ are Bernoulli random variables. Figure 9 illustrates this encoding method 
for particular % and j. Two rows (901) of the array are shown. Truly Bernoulli inputs 
v$ (902) are fed into one row. The inputs of the other row are stochastically modulated 
binary coefficients of the informative input x n = x n - u n (903). Inner-products (904) 
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of approximately normal distribution are computed on both rows. Their smaller active 
range allows to relax the requirements on the resolution of the quantizer (905) by a fac- 
tor Af 1/2 . The desired inner-products for X n (906) are retrieved by digitally adding the 
inner-products obtained for X n and U n . The random offset U n can be chosen once, so 
its inner-product with the templates can be pre-computed upon initializing or program- 
ming the array (in other words, the computation performed by the top row in Figure 9 
takes place only once). The implementation cost is thus limited to component- wise 
subtraction of X n and U n > achieved using one full adder cell, one bit register, and 
ROM (read-only memory) storage of the u n ^ bits for every column of the array. 



